Extract Information (Text Processing)
Synopsis
This operator extracts information from a document with structured content.Description
The purpose of this operator is to extract informations from the structured content of a document. The extracted information will be added as meta data to the document and if wished might be added as attribute later. There are several options available for specifying which information should be extracted. In String Matching mode you may specify a start String and an end String, if both are found in the document, the characters between are extracted. Regular Expressions let you specify any expression and will use the first matching group as extraction. If it's to difficult to include the intermediate characters into the expression in a well defined way, you might find Regular Region mode useful, where you can define two regular expressions. As on String Matching mode, the first defines the start and the last the end and anything intermediate will be extracted. The most sophisticated variant is the XPath mode, where you can enter an arbitrary XPath expression. This proves usefull, especially when trying to extract information from a website. Since XPath expressions are only available for XML files, you will have to take care, that the documents are well defined XML. This might be ensured by the assume_html parameter of the Document Processing operator, that will use a special parser to correct errors in the HTML. It is also possible to extract informations from a JSON document with a JSONPath expression. As with the XPath mode, you will have to take care, that the document provided is a valid JSON document.
Input
- document
Output
- documents (Collection)
Collection of the extracted informations.
Parameters
- query type Specifies the type of the query. The available query types are: String Matching, Regular Expression, Regular Region, Indexed, XPath and JSONPath; Range: selection
- string matching queries Specifies a list of string matching start and end sequences. Everything between will be used as result. See the operator documentation for details on string matching. Range: list
- attribute type Specifies the type of the resulting attributes. If numerical or binomial is chosen, ensure that the returned result is interpretable. The available types are: Nominal, Numerical and Binominal ; Range: selection
- regular expression queries Specifies a list of attribute names and their corresponding regular expressions. The first matching group is used as value. See the operator documentation for details on regular expressions. Range: list
- regular region queries Specifies a list of attribute names and their corresponding regular expressions. Two regular expressions might be specified in order to define the start and the end of a region. Everything in between the two matches will be delivered as result. Range: list
- xpath queries Specifies a list of attribute names and their corresponding XPath queries. See the operator documentation for details on XPath. Range: list
- namespaces Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h. Range: list
- ignore CDATA Indicates if CDATA should be ignored when using the XPATH expression. Range: boolean
- assume html If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this. Range: boolean
- index queries Specifies a list of attribute names and the regions. Regions are specified as offset index and length of the match. Range: list
- jsonpath queries Specifies a list of attribute names and their corresponding JSONPath queries. Range: list